\chapter{Efficient Learning} \label{sec:opt}

\chapterquote{One essential object is to choose that arrangement which shall tend to reduce to a minimum the time necessary for completing the calculation.}{Ada Lovelace}

\begin{learningobjectives}
\item Understand and be able to implement stochastic gradient descent
  algorithms.
\item Compare and contrast small versus large batch sizes in
  stochastic optimization.
\item Derive subgradients for sparse regularizers.
\item Implement feature hashing.
\end{learningobjectives}

\dependencies{}

\newthought{So far, our focus has been on} \emph{models} of learning
and basic algorithms for those models.  We have not placed much
emphasis on how to learn \emph{quickly}.  The basic techniques you
learned about so far are enough to get learning algorithms running on
tens or hundreds of thousands of examples.  But if you want to build
an algorithm for web page ranking, you will need to deal with millions
or billions of examples, in hundreds of thousands of dimensions.  The
basic approaches you have seen so far are insufficient to achieve such
a massive scale.

In this chapter, you will learn some techniques for scaling learning
algorithms.  This are useful even when you do not have billions of
training examples, because it's always nice to have a program that
runs quickly.  You will see techniques for speeding up both model
training and model prediction.  The focus in this chapter is on linear
models (for simplicity), but most of what you will learn applies more
generally.

\section{What Does it Mean to be Fast?}

Everyone always wants fast algorithms.  In the context of machine
learning, this can mean many things.  You might want fast training
algorithms, or perhaps training algorithms that scale to very large
data sets (for instance, ones that will not fit in main memory).  You
might want training algorithms that can be easily parallelized.  Or,
you might not care about training efficiency, since it is an offline
process, and only care about how quickly your learned functions can
make classification decisions.

It is important to separate out these desires.  If you care about
efficiency at training time, then what you are really asking for are
more efficient learning algorithms.  On the other hand, if you care
about efficiency at test time, then you are asking for \emph{models}
that can be quickly evaluated.

One issue that is not covered in this chapter is parallel learning.
This is largely because it is currently not a well-understood area in
machine learning.  There are many aspects of parallelism that come
into play, such as the speed of communication across the network,
whether you have shared memory, etc.  Right now, this the general,
poor-man's approach to parallelization, is to employ ensembles.

\section{Stochastic Optimization}

During training of most learning algorithms, you consider the entire
data set simultaneously.  This is certainly true of gradient descent
algorithms for regularized linear classifiers (recall
Algorithm~\ref{alg:loss:gd}), in which you first compute a gradient
over the entire training data (for simplicity, consider the unbiased
case):
%
\begin{align}
\vg
&= \sum_n \grad_{\vw} \ell(y_n, \dotp{\vw}{\vx_n}) + \la \vw
\end{align}
%
where $\ell(y,\hat y)$ is some loss function.  Then you update the
weights by $\vw \leftarrow \vw - \eta \vg$.  In this algorithm, in
order to make a \emph{single} update, you have to look at \emph{every}
training example.

When there are billions of training examples, it is a bit silly to
look at every one before doing anything.  Perhaps just on the basis of
the first few examples, you can already start learning something!

Stochastic optimization involves thinking of your training data as a
big distribution over examples.  A draw from this distribution
corresponds to picking some example (uniformly at random) from your
data set.  Viewed this way, the optimization problem becomes a
\concept{stochastic optimization} problem, because you are trying to
optimize some function (say, a regularized linear classifier) over a
probability distribution.  You can derive this intepretation directly
as follows:
%
\begin{align}
\vw^*
&= \arg\max_{\vw} \sum_n \ell(y_n, \dotp{\vw}{\vx_n}) + R(\vw) 
\becauseof{definition}\\
&= \arg\max_{\vw} \sum_n \left[ \ell(y_n, \dotp{\vw}{\vx_n}) + \frac 1 N R(\vw) \right] 
\becauseof{move $R$ inside sum}\\
&= \arg\max_{\vw} \sum_n \left[ \frac 1 N \ell(y_n, \dotp{\vw}{\vx_n}) + \frac 1 {N^2} R(\vw) \right] 
\becauseof{divide through by $N$}\\
&= \arg\max_{\vw} \Ep_{(y,\vx) \sim D} \left[ \ell(y, \dotp{\vw}{\vx}) + \frac 1 N R(\vw) \right] 
\becauseof{write as expectation}\\
& \text{where } D \text{ is the training data distribution}
\end{align}
%

\newalgorithm{opt:sgd}%
  {\FUN{StochasticGradientDescent}($\cF$, $\cD$, \VAR{S}, \VAR{K}, \VAR{$\eta_1$}, \dots)}
  {
\SETST{$\vz\zth$}{$\langle \CON0,\CON0, \dots, \CON0\rangle$}
  \COMMENT{initialize variable we are optimizing}
\FOR{\VAR{k} = \CON{1} \dots \VAR{K}}
\SETST{$D\kth$}{$S$-many random data points from $\cD$}
\SETST{$\vg\kth$}{$\left. \grad_\vz \cF(\VARm D\kth) \right\vert_{\VARm{\vz\kpth}}$}
  \COMMENT{compute gradient on sample}
\SETST{$\vz\kth$}{$\VARm{\vz\kpth} - \VARm{\eta}\VARm{\kth} \VARm{\vg\kth}$}
  \COMMENT{take a step down the gradient}
\ENDFOR
\RETURN \VARm{$\vz\Kth$}
}

Given this framework, you have the following general form of an
optimization problem:
%
\optimizeuc{opt:stochastic}{\vz}{ \Ep_{\ze}[ \cF(\vz, \ze) ]}
%
In the example, $\ze$ denotes the random choice of examples over the
dataset, $\vz$ denotes the weight vector and $\cF(\vw,\ze)$ denotes
the loss on that example \emph{plus} a fraction of the regularizer.

Stochastic optimization problems are formally \emph{harder} than
regular (deterministic) optimization problems because you do not even
get access to exact function values and gradients.  The only access
you have to the function $\cF$ that you wish to optimize are noisy
measurements, governed by the distribution over $\ze$.  Despite this
lack of information, you can still run a gradient-based algorithm,
where you simply compute \emph{local} gradients on a current sample of
data.

More precisely, you can draw a data point at random from your data
set.  This is analogous to drawing a single value $\ze$ from its
distribution.  You can compute the gradient of $\cF$ just at that
point.  In this case of a 2-norm regularized linear model, this is
simply $\vg = \grad_{\vw} \ell(y, \dotp{\vw}{\vx}) + \frac 1 N \vw$,
where $(y,\vx)$ is the random point you selected.  Given this
\emph{estimate} of the gradient (it's an estimate because it's based
on a single random draw), you can take a small gradient step $\vw
\leftarrow \vw - \eta \vg$.

This is the \concept{stochastic gradient descent} algorithm
(\concept{SGD}).  In practice, taking gradients with respect to a
\emph{single} data point might be too myopic.  In such cases, it is
useful to use a small \concept{batch} of data.  Here, you can draw
$10$ random examples from the training data and compute a small
gradient (estimate) based on those examples: $\vg = \sum_{m=1}^{10}
\grad_{\vw} \ell(y_m, \dotp{\vw}{\vx_m}) + \frac {10} N \vw$, where
you need to include $10$ counts of the regularizer.  Popular batch
sizes are $1$ (single points) and $10$.  The generic SGD algorithm is
depicted in Algorithm~\ref{alg:opt:sgd}, which takes $K$-many steps
over batches of $S$-many examples.

In stochastic gradient descent, it is \emph{imperative} to choose good
step sizes.  It is also very important that the steps get smaller over
time at a reasonable slow rate.  In particular, convergence can be
guaranteed for learning rates of the form: $\eta\kth = \frac {\eta_0}
{\sqrt{k}}$, where $\eta_0$ is a fixed, initial step size, typically
$0.01$, $0.1$ or $1$ depending on how quickly you expect the algorithm
to converge.  Unfortunately, in comparisong to gradient descent,
stochastic gradient is quite sensitive to the selection of a good
learning rate.

There is one more practical issues related to the use of SGD as a
learning algorithm: do you \emph{really} select a random point (or
subset of random points) at each step, or do you stream through the
data in order.  The answer is akin to the answer of the same question
for the perceptron algorithm (Chapter~\ref{sec:perc}).  If you do not
permute your data at all, very bad things can happen.  If you
\emph{do} permute your data once and then do multiple passes over that
same permutation, it will converge, but more slowly.  In theory, you
really should permute every iteration.  If your data is small enough
to fit in memory, this is not a big deal: you will only pay for cache
misses.  However, if your data is too large for memory and resides on
a magnetic disk that has a slow seek time, randomly seeking to new
data points for each example is prohibitivly slow, and you will likely
need to forgo permuting the data.  The speed hit in convergence speed
will almost certainly be recovered by the speed gain in not having to
seek on disk routinely.  (Note that the story is very different for
solid state disks, on which random accesses really are quite
efficient.)


\section{Sparse Regularization}

For many learning algorithms, the test-time efficiency is governed by
how many features are used for prediction.  This is one reason
decision trees tend to be among the fastest predictors: they only use
a small number of features.  Especially in cases where the actual
\emph{computation} of these features is expensive, cutting down on the
number that are used at test time can yield huge gains in efficiency.
Moreover, the amount of memory used to make predictions is also
typically governed by the number of features.  (Note: this is
\emph{not} true of kernel methods like support vector machines, in
which the dominant cost is the number of support vectors.)
Furthermore, you may simply \emph{believe} that your learning problem
can be solved with a very small number of features: this is a very
reasonable form of inductive bias.

This is the idea behind sparse models, and in particular, sparse
regularizers.  One of the disadvantages of a 2-norm regularizer for
linear models is that they tend to never produce weights that are
\emph{exactly} zero.  They get close to zero, but never hit it.  To
understand why, as a weight $w_d$ approaches zero, its gradient
\emph{also} approaches zero.  Thus, even if the weight \emph{should}
be zero, it will essentially never get there because of the constantly
shrinking gradient.

This suggests that an alternative regularizer is required to yield a
sparse inductive bias.  An ideal case would be the zero-norm
regularizer, which simply counts the number of non-zero values in a
vector: $\norm{\vw}_0 = \sum_d [w_d \neq 0]$.  If you could minimize
this regularizer, you would be explicitly minimizing the number of
non-zero features.  Unfortunately, not only is the zero-norm
non-convex, it's also discrete.  Optimizing it is NP-hard.

A reasonable middle-ground is the one-norm: $\norm{\vw}_1 = \sum_d
\ab{w_d}$.  It is indeed convex: in fact, it is the tighest $\ell_p$
norm that \emph{is} convex.  Moreover, its gradients do not go to zero
as in the two-norm.  Just as hinge-loss is the tightest convex
upper bound on zero-one error, the one-norm is the tighest convex
upper bound on the zero-norm.

At this point, you should be content.  You can take your subgradient
optimizer for arbitrary functions and plug in the one-norm as a
regularizer.  The one-norm is surely non-differentiable at $w_d = 0$,
but you can simply choose any value in the range $[-1,+1]$ as a
subgradient at that point.  (You should choose zero.)

Unfortunately, this does not quite work the way you might expect.  The
issue is that the gradient might ``overstep'' zero and you will never
end up with a solution that is particularly sparse.  For example, at
the end of one gradient step, you might have $w_3 = 0.6$.  Your
gradient might have $g_6 = 0.8$ and your gradient step (assuming
$\eta=1$) will update so that the new $w_3 = -0.2$.  In the subsequent
iteration, you might have $g_6 = -0.3$ and step to $w_3 = 0.1$.

This observation leads to the idea of \concept{trucated gradients}.
The idea is simple: if you have a gradient that would step you over
$w_d = 0$, then just set $w_d = 0$.  In the easy case when the
learning rate is $1$, this means that if the \emph{sign} of $w_d -
g_d$ is different than the sign of $w_d$ then you truncate the
gradient step and simply set $w_d = 0$.  In other words, $g_d$ should
never be \emph{larger} than $w_d$ Once you incorporate learning
rates, you can express this as:
%
\begin{align}
g_d \leftarrow
\brack{
  g_d & \text{if } w_d > 0 \text{ and } g_d \leq \frac 1 {\eta\kth} w_d\\
  g_d & \text{if } w_d < 0 \text{ and } g_d \geq \frac 1 {\eta\kth} w_d\\
  0   & \text{otherwise}
}
\end{align}
%
This works quite well in the case of subgradient descent.  It works
somewhat less well in the case of \emph{stochastic} subgradient
descent.  The problem that arises in the stochastic case is that
wherever you choose to stop optimizing, you will have just touched a
single example (or small batch of examples), which will increase the
weights for a lot of features, before the regularizer ``has time'' to
shrink them back down to zero.  You will still end up with somewhat
sparse solutions, but not as sparse as they could be.  There are
algorithms for dealing with this situation, but they all have a
heuristic flavor to them and are beyond the scope of this book.

% Although one-norm regularizers are very popular for sparse
% regularization, they are not the only way to achieve sparsity.  An
% alternative regularizer is the KL-regularizer.  The motivation for the
% KL-regularizer differs slightly from the previous notion of sparsity.
% Instead of ``hoping'' that most of the weights go to zero, you instead
% hope that most of the weights tend toward some positive constant $p$.  
% The entropy
% regularizer only works when the weights of your model are
% non-negative, though this is not a big problem in practice: you can
% simply double the number of features and for each $x_d$ include a new
% feature whose value is $-x_d$.  Given \emph{strictly positive} weights
% $\vw$, the entropy regularizer has the form:
% %
% \begin{align}
%   R^\xth{ent}(\vw) &= - \sum_d



\section{Feature Hashing}

As much as speed is a bottleneck in prediction, so often is memory
usage.  If you have a very large number of features, the amount of
memory that it takes to store weights for all of them can become
prohibitive, especially if you wish to run your algorithm on small
devices.  Feature hashing is an incredibly simple technique for
reducing the memory footprint of linear models, with very small
sacrifices in accuracy.

The basic idea is to replace all of your features with hashed versions
of those features, thus reducing your space from $D$-many feature
weights to $P$-many feature weights, where $P$ is the range of the
hash function.  You can actually think of hashing as a (randomized)
feature mapping $\phi : \R^D \fto \R^P$, for some $P \ll D$.  The idea
is as follows.  First, you choose a hash function $h$ whose domain is
$[D] = \{ 1, 2, \dots, D \}$ and whose range is $[P]$.  Then, when you
receive a feature vector $\vx \in \R^D$, you map it to a shorter
feature vector $\hat\vx \in \R^P$.  Algorithmically, you can think of
this mapping as follows:

\begin{enumerate}
  \item Initialize $\hat\vx = \langle 0, 0, \dots, 0 \rangle$
  \item For each $d = 1 \dots D$:
    \begin{enumerate}
      \item Hash $d$ to position $p = h(d)$
      \item Update the $p$th position by adding $x_d$: $\hat x_p \leftarrow \hat\vx_p + x_d$
    \end{enumerate}
  \item Return $\hat\vx$
\end{enumerate}

\noindent
Mathematically, the mapping looks like:
%
\begin{align}
  \phi(\vx)_p
  &= \sum_d [h(d) = p] x_d 
  \quad= \sum_{d \in h\inv(p)} x_d
\end{align}
%
where $h\inv(p) = \{ d : h(d) = p \}$.

In the (unrealistic) case where $P = D$ and $h$ simply encodes a
permutation, then this mapping does not change the learning problem at
all.  All it does is rename all of the features.  In practice, $P \ll
D$ and there will be collisions.  In this context, a collision means
that two features, which are really different, end up looking the same
to the learning algorithm.  For instance, ``is it sunny today?'' and
``did my favorite sports team win last night?'' might get mapped to
the same location after hashing.  The hope is that the learning
algorithm is sufficiently robust to noise that it can handle this case
well.

Consider the kernel defined by this hash mapping.  Namely:
%
\begin{align}
  K\xth{hash}(\vx, \vz) 
  &=
  \dotp{\phi(\vx)}{\phi(\vz)} \\
  &=
  \sum_p \left( \sum_d [h(d) = p] x_d \right) \left( \sum_d [h(d) = p] z_d \right)\\
  &=
  \sum_p \sum_{d,e} [h(d) = p] [h(e) = p] x_d z_e \\
  &=
  \sum_d \sum_{e \in h\inv(h(d))} x_d z_e \\
  &=
  \dotp{\vx}{\vz} + \sum_d \sum_{\substack{e \neq d,\\e \in h\inv(h(d))}} x_d z_e
\end{align}
%
This \concept{hash kernel} has the form of a linear kernel plus a
small number of quadratic terms.  The particular quadratic terms are
exactly those given by collisions of the hash function.

There are two things to notice about this.  The first is that
collisions might not actually be bad things!  In a sense, they're
giving you a little extra representational power.  In particular, if
the hash function happens to select out feature pairs that benefit
from being paired, then you now have a better representation.  The
second is that even if this doesn't happen, the quadratic term in the
kernel has only a small effect on the overall prediction.  In
particular, if you assume that your hash function is pairwise
independent (a common assumption of hash functions), then the
\emph{expected value} of this quadratic term is zero, and its variance
decreases at a rate of $\cO(P^{-2})$.  In other words, if you choose
$P \approx 100$, then the variance is on the order of $0.0001$.

\section{Further Reading}

TODO further reading


\begin{comment}
already know (sub)GD from loss.tex
talk about stochastic (sub)GD
formal optimization and duals
hashing tricks
sparsity
\end{comment}


%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "courseml"
%%% End: 
